Markup language

A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts). Examples are typesetting instructions such as those found in troff and LaTeX, and structural markers such as XML tags. Markup is typically omitted from the version of the text which is displayed for end-user consumption. Some markup languages, like HTML have presentation semantics, meaning their specification prescribes how the structured data is to be presented, but other markup languages, like XML, have no predefined semantics.

A well-known example of a markup language in widespread use today is HyperText Markup Language (HTML), one of the document formats of the World Wide Web. HTML is mostly an instance of SGML (though, strictly, it does not comply with all the rules of SGML) and follows many of the markup conventions used in the publishing industry in the communication of printed work between authors, editors, and printers.

Contents

Types

There are three general categories of electronic markup: Presentational, procedural, and descriptive.[1][2]

There is considerable blurring of the lines between the types of markup. In modern word-processing systems, presentational markup is often saved in descriptive-markup-oriented systems such as XML, and then processed procedurally by implementations. The programming constructs in descriptive-markup systems such as TeX may be used to create higher-level markup systems which are more descriptive, such as LaTeX.

In recent years, a number of small and largely unstandardized markup languages have been developed to allow authors to create formatted text via web browsers, for use in wikis and web forums. These are sometimes called Lightweight markup languages. The markup language used by Wikipedia is one such.

History

The term markup is derived from the traditional publishing practice of "marking up" a manuscript, which involves adding handwritten annotations in the form of conventional symbolic printer's instructions in the margins and text of a paper manuscript or printed proof. For centuries, this task was done primarily by skilled typographers known as "markup men"[3] or "copy markers"[4] who marked up text to indicate what typeface, style, and size should be applied to each part, and then passed the manuscript to others for typesetting by hand. Markup was also commonly applied by editors, proofreaders, publishers, and graphic designers, and indeed by document authors.

GenCode

The first well-known public presentation of markup languages in computer text processing was made by William W. Tunnicliffe at a conference in 1967, although he preferred to call it generic coding. It can be seen as a response to the emergence of programs such as RUNOFF that each used their own control notations, often specific to the target typesetting device. In the 1970s, Tunnicliffe led the development of a standard called GenCode for the publishing industry and later was the first chair of the International Organization for Standardization committee that created SGML, the first standard descriptive markup language. Book designer Stanley Rice published speculation along similar lines in 1970.[5] Brian Reid, in his 1980 dissertation at Carnegie Mellon University, developed the theory and a working implementation of descriptive markup in actual use.

However, IBM researcher Charles Goldfarb is more commonly seen today as the "father" of markup languages. Goldfarb hit upon the basic idea while working on a primitive document management system intended for law firms in 1969, and helped invent IBM GML later that same year. GML was first publicly disclosed in 1973.

In 1975, Goldfarb moved from Cambridge, Massachusetts to Silicon Valley and became a product planner at the IBM Almaden Research Center. There, he convinced IBM's executives to deploy GML commercially in 1978 as part of IBM's Document Composition Facility product, and it was widely used in business within a few years.

SGML, which was based on both GML and GenCode, was developed by Goldfarb in 1974.[6] Goldfarb eventually became chair of the SGML committee. SGML was first released by ISO as the ISO 8879 standard in October 1986.

Some early examples of computer markup languages available outside the publishing industry can be found in typesetting tools on Unix systems such as troff and nroff. In these systems, formatting commands were inserted into the document text so that typesetting software could format the text according to the editor's specifications. It was a trial and error iterative process to get a document printed correctly. Availability of WYSIWYG ("what you see is what you get") publishing software supplanted much use of these languages among casual users, though serious publishing work still uses markup to specify the non-visual structure of texts, and WYSIWYG editors now usually save documents in a markup-language-based format.

TeX

Another major publishing standard is TeX, created and continuously refined by Donald Knuth in the 1970s and '80s. TeX concentrated on detailed layout of text and font descriptions in order to typeset mathematical books in professional quality. This required Knuth to spend considerable time investigating the art of typesetting. However, TeX has a steep learning curve, so that it is mainly used in academia, where it is the de facto standard in many scientific disciplines. A TeX macro package known as LaTeX provides a descriptive markup system on top of TeX, and is widely used.

Scribe, GML and SGML

The first language to make a clear and clean distinction between structure and presentation was Scribe, developed by Brian Reid and described in his doctoral thesis in 1980.[7] Scribe was revolutionary in a number of ways, not least that it introduced the idea of styles separated from the marked up document, and of a grammar controlling the usage of descriptive elements. Scribe influenced the development of Generalized Markup Language (later SGML) and is a direct ancestor to HTML and LaTeX.

In the early 1980s, the idea that markup should be focused on the structural aspects of a document and leave the visual presentation of that structure to the interpreter led to the creation of SGML. The language was developed by a committee chaired by Goldfarb. It incorporated ideas from many different sources, including Tunnicliffe's project, GenCode. Sharon Adler, Anders Berglund, and James A. Marke were also key members of the SGML committee.

SGML specified a syntax for including the markup in documents, as well as one for separately describing what tags were allowed, and where (the Document Type Definition (DTD) or schema). This allowed authors to create and use any markup they wished, selecting tags that made the most sense to them and were named in their own natural languages. Thus, SGML is properly a meta-language, and many particular markup languages are derived from it. From the late '80s on, most substantial new markup languages have been based on SGML system, including for example TEI and DocBook. SGML was promulgated as an International Standard by International Organization for Standardization, ISO 8879, in 1986.

SGML found wide acceptance and use in fields with very large-scale documentation requirements. However, it was generally found to be cumbersome and difficult to learn, a side effect of attempting to do too much and be too flexible. For example, SGML made end tags (or start-tags, or even both) optional in certain contexts, because it was thought that markup would be done manually by overworked support staff who would appreciate saving keystrokes.

HTML

By 1991, it appeared to many that SGML would be limited to commercial and data-based applications while WYSIWYG tools (which stored documents in proprietary binary formats) would suffice for other document processing applications. The situation changed when Sir Tim Berners-Lee, learning of SGML from co-worker Anders Berglund and others at CERN, used SGML syntax to create HTML. HTML resembles other SGML-based tag languages, although it began as simpler than most and a formal DTD was not developed until later. Steven DeRose[8] argues that HTML's use of descriptive markup (and SGML in particular) was a major factor in the success of the Web, because of the flexibility and extensibility that it enabled (other factors include the notion of URLs and the free distribution of browsers). HTML is quite likely the most used markup language in the world today.

Some would restrict the term "markup language" to systems that directly support non-hierarchical structures (see Hierarchical model). By this definition HTML, XML, and even SGML (apart from its rarely used CONCUR option) would be disqualified and called "container languages" instead. However, the term "container language" is not in widespread use, and such hierarchical languages are almost universally considered markup languages. There is active research on non-hierarchical markup models, some expressed within XML and related languages (for example, using the Text Encoding Initiative Guidelines and derivatives such as the Open Scripture Information Standard and CLIX), and some not (for example, MECS and the Layed Markup and Annotation Language or LMNL). Much of this research is published in the proceedings of the Extreme Markup and Balisage conferences, generally held in Montreal.

XML

XML (Extensible Markup Language) is a meta markup language that is now widely used. XML was developed by the World Wide Web Consortium, in a committee created and chaired by Jon Bosak. The main purpose of XML was to simplify SGML by focusing on a particular problem — documents on the Internet.[9] XML remains a meta-language like SGML, allowing users to create any tags needed (hence "extensible") and then describing those tags and their permitted uses.

XML adoption was helped because every XML document can be written in such a way that it is also an SGML document, and existing SGML users and software could switch to XML fairly easily. However, XML eliminated many of the more complex and human-oriented features of SGML to simplify implementation environments such as documents and publications. However, it appeared to strike a happy medium between simplicity and flexibility, and was rapidly adopted for many other uses. XML is now widely used for communicating data between applications. Like HTML, it can be described as a 'container' language.

XHTML

Since January 2000 all W3C Recommendations for HTML have been based on XML rather than SGML, using the abbreviation XHTML (Extensible HyperText Markup Language). The language specification requires that XHTML Web documents must be well-formed XML documents – this allows for more rigorous and robust documents while using tags familiar from HTML.

One of the most noticeable differences between HTML and XHTML is the rule that all tags must be closed: empty HTML tags such as <br> must either be closed with a regular end-tag, or replaced by a special form: <br /> (the space before the '/' on the end tag is optional, but frequently used because it enables some pre-XML Web browsers, and SGML parsers, to accept the tag). Another is that all attribute values in tags must be quoted. Finally, all tag and attribute names must be lowercase in order to be valid; HTML, on the other hand, was case-insensitive.

Other XML-based applications

Many XML-based applications now exist, including Resource Description Framework (RDF), XForms, DocBook, SOAP and the Web Ontology Language (OWL). For a partial list of these see List of XML markup languages.

Features

A common feature of many markup languages is that they intermix the text of a document with markup instructions in the same data stream or file. This is not necessary; it is possible to isolate markup from text content, using pointers, offsets, IDs, or other methods to co-ordinate the two. Such "standoff markup" is typical for the internal representations that programs use to work with marked-up documents. However, embedded or "inline" markup is much more common elsewhere. Here, for example, is a small section of text marked up in HTML:

<h1> Anatidae </h1>
<p>
The family <i>Anatidae</i> includes ducks, geese, and swans,
but <em>not</em> the closely related screamers.
</p>

The codes enclosed in angle-brackets <like this> are markup instructions (known as tags), while the text between these instructions is the actual text of the document. The codes h1, p, and em are examples of semantic markup, in that they describe the intended purpose or meaning of the text they include. Specifically, h1 means "this is a first-level heading", p means "this is a paragraph", and em means "this is an emphasized word or phrase". A program interpreting such structural markup may apply its own rules or styles for presenting the various pieces of text, using diffent typefaces, boldness, font size, indention, colour, or other styles, as desired. A tag such as "h1" (header level 1) might be presented in a large bold sans-serif typeface, for example, or in a monospaced (typewriter-style) document it might be underscored – or it might not change the presentation at all.

In contrast, the i tag in HTML is an example of presentational markup; it is generally used to specify a particular characteristic of the text (in this case, the use of an italic typeface) without specifying the reason for that appearance.

The Text Encoding Initiative (TEI) has published extensive guidelines[10] for how to encode texts of interest in the humanities and social sciences, developed through years of international cooperative work. These guidelines are used by projects encoding historical documents, the works of particular scholars, periods, or genres, and so on.

Alternative usage

While the idea of markup language originated with text documents, there is an increasing usage of markup languages in other areas which involve the presentation of various types of information, including playlists, vector graphics, web services, content syndication, and user interfaces. Most of these are XML applications because it is a well-defined and extensible language.

The use of XML has also led to the possibility of combining multiple markup languages into a single profile, like XHTML+SMIL and XHTML+MathML+SVG[11]

Because markup languages, and more generally data description languages (not necessarily textual markup), are not programming languages[12] (they are data without instructions), they are more easily manipulated than programming languages – for example, web pages are presented as HTML documents, not C code, and thus can be embedded within other web pages, displayed when only partially received, and so forth. This leads to the web design principle of the "Rule of Least Power", which advocates using the least (computationally) powerful language that satisfies a task to facilitate such manipulation and reuse.

See also

References

  1. ^ Coombs, James H.; Renear, Allen H.; DeRose, Steven J. (November 1987). "Markup systems and the future of scholarly text processing". Communications of the ACM (ACM) 30 (11): 933–947. doi:10.1145/32206.32209. http://xml.coverpages.org/coombs.html. 
  2. ^ "Taxonomy of Markup". http://www.tbray.org/ongoing/When/200x/2003/04/09/SemanticMarkup#p-1. 
  3. ^ Allan Woods, Modern Newspaper Production (New York: Harper & Row, 1963), 85; Stewart Harral, Profitable Public Relations for Newspapers (Ann Arbor: J.W. Edwards, 1957), 76; and Chiarella v. United States, 445 U.S. 222 (1980).
  4. ^ From the Notebooks of H.J.H & D.H.A on Composition, Kingsport Press Inc., undated (1960s).
  5. ^ Rice, Stanley. “Editorial Text Structures (with some relations to information structures and format controls in computerized composition).” American National Standards Institute, March 17, 1970.
  6. ^ "2009 interview with SGML creator Charles F. Goldfarb". Dr. Dobb's Journal. http://www.drdobbs.com/blog/archives/2009/08/beyond_html_an.html. Retrieved 2010-07-18. 
  7. ^ Reid, Brian. "Scribe: A Document Specification Language and its Compiler." Ph.D. thesis, Carnegie-Mellon University, Pittsburgh PA. Also available as Technical Report CMU-CS-81-100.
  8. ^ DeRose, Steven J. "The SGML FAQ Book." Boston: Kluwer Academic Publishers, 1997. ISBN 0-7923-9943-9
  9. ^ Extensible Markup Language (XML)
  10. ^ http://www.tei-c.org/Guidelines2/index.html
  11. ^ An XHTML + MathML + SVG Profile". W3C, August 9, 2002. Retrieved on 17 March 2007.
  12. ^ Korpela, Jukka (2005-11-16). "Programs vs. markup". IT and communication. Tampere University of Technology. http://www.cs.tut.fi/~jkorpela/prog.html. Retrieved 2011-01-08. 

External links